Skip to content

fix(data): prompt-length filtering crashes on VLM dataset with apply_chat_template#2126

Open
Meihan-chen wants to merge 2 commits into
THUDM:mainfrom
Meihan-chen:fix/multimodal-length-filter
Open

fix(data): prompt-length filtering crashes on VLM dataset with apply_chat_template#2126
Meihan-chen wants to merge 2 commits into
THUDM:mainfrom
Meihan-chen:fix/multimodal-length-filter

Conversation

@Meihan-chen

@Meihan-chen Meihan-chen commented Jun 23, 2026

Copy link
Copy Markdown

Repro

Prompt-length filtering crashes for multimodal (VLM) datasets when --apply-chat-template is set, making the feature unusable. Reproduced end-to-end on the geo3k VLM example with --apply-chat-template + --rollout-max-prompt-len:

slime/utils/data.py:111, in filter_long_prompt
    multimodal_inputs = process_vision_info(sample.prompt, processor)
TypeError: string indices must be integers, not 'str'

Root cause

In filter_long_prompt (slime/utils/data.py), the multimodal branch re-derived vision info from sample.prompt:

multimodal_inputs = process_vision_info(sample.prompt, processor)
processor_output = processor(text=sample.prompt, **multimodal_inputs)

With apply_chat_template=True, Sample.prompt is the rendered string, but filter_long_prompt passed it to process_vision_info, which expects a conversation list → crash. The vision inputs are already computed and stored in Sample.multimodal_inputs, so this recomputation is both wrong and redundant.

Why it's easy to hit

Setting --rollout-max-context-len (which derives rollout_max_prompt_len), --rollout-max-prompt-len, or --eval-max-prompt-len activates the filter on a VLM dataset and trips the crash.

Fix

Reuse the multimodal inputs already stored on the sample, routed through the same build_processor_kwargs helper the rollout path (sglang_rollout) uses, so the token length measured during filtering matches the real pipeline:

processor_kwargs = build_processor_kwargs(sample.multimodal_inputs)
processor_output = processor(text=sample.prompt, **processor_kwargs)

filter_long_prompt re-extracted vision info from sample.prompt via
process_vision_info in the multimodal branch. When apply_chat_template
is set, sample.prompt is the rendered *string* (not a conversation
list), so process_vision_info -> qwen_vl_utils crashed with
"TypeError: string indices must be integers, not 'str'".

This made prompt-length filtering unusable for any VLM dataset: setting
--rollout-max-context-len (which derives rollout_max_prompt_len) or
--rollout-max-prompt-len / --eval-max-prompt-len activates the filter
and hits the crash.

Reuse the multimodal inputs already computed during dataset
construction via build_processor_kwargs (matching the sglang_rollout
path) instead of recomputing them from the string prompt.

Add CPU unit tests covering the multimodal branch and a mixed
text-only + multimodal dataset.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
@Meihan-chen Meihan-chen changed the title fix(data): reuse stored multimodal_inputs in filter_long_prompt to fix VLM length-filtering crash fix(data): prompt-length filtering crashes on any VLM dataset with apply_chat_template Jun 23, 2026
@Meihan-chen Meihan-chen force-pushed the fix/multimodal-length-filter branch from 7447cd0 to a4560f6 Compare June 23, 2026 10:00
@Meihan-chen Meihan-chen changed the title fix(data): prompt-length filtering crashes on any VLM dataset with apply_chat_template fix(data): prompt-length filtering crashes on VLM dataset with apply_chat_template Jun 23, 2026
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant